Sentence Alignment for Spanish-Basque Bitexts: Word Correspondences vs. Markup Similarity
نویسندگان
چکیده
In this paper, we present an evaluation of two different sentence alignment techniques. One is the well-known SIMR algorithm based on word correspondences on both sides of a bitext. The other one is the ALINOR algorithm, which is based on the similarity of the markup on both sides of a bitext. Both algorithms are accurate in 1-1 alignment, but ALINOR works slightly better in the case of N-M alignment.
منابع مشابه
Bitext Correspondences through Rich Mark-up
Rich mark-up can considerably benefit the process of establishing bitext correspondences, that is, the task of providing correct identification and alignment methods for text segments that are translation equivalences of each other in a parallel corpus. We present a sentence alignment algorithm that, by taking advantage of previously annotated texts, obtains accuracy rates close to 100%. The al...
متن کاملAligning tagged bitexts
This paper describes how complementary techniques can be employed to align multiword expressions in a parallel corpus or bitext. The bitext used for experimentation has two main features: (i) it contains bilingual documents from a dedicated domain of legal and administrative publications rich in specialized jargon; (ii) it involves two languages, Spanish and Basque, which are typologically very...
متن کاملIdentifying Complex Sound Correspondences in Bilingual Wordlists
The determination of recurrent sound correspondences between languages is crucial for the identification of cognates, which are often employed in statistical machine translation for sentence and word alignment. In this paper, an algorithm designed for extracting non-compositional compounds from bitexts is shown to be capable of determining complex sound correspondences in bilingual wordlists. I...
متن کاملImproved Word-Level Alignment: Injecting Knowledge about MT Divergences
Under consideration for other conferences (specify)? none Abstract Word-level alignments of bilingual text (bitexts) are not only an integral part of statistical machine translation models, but also useful for lexical acquisition, treebank construction, and part-of-speech tagging. The frequent occurrence of divergences, structural diierences between languages, presents a great challenge to the ...
متن کاملComputational Lexicography and Lexicology Elexbi, a Basic Tool for Bilingual Term Extraction from Spanish-Basque Parallel Corpora
We present the work done by Elhuyar Foundation in the field of bilingual terminology extraction. The aim ofthis work is to develop some techniques for the automatic extraction ofpairs ofequivalent terms from Spanish-Basque translation memories, and to implement those techniques in a prototype. Our approach is based on a monolingual extraction of term candidates in each language, then the creati...
متن کامل